4 research outputs found

    Performance Evaluation of Blocking and Non-Blocking Concurrent Queues on GPUs

    Get PDF
    The efficiency of concurrent data structures is crucial to the performance of multi-threaded programs in shared-memory systems. The arbitrary execution of concurrent threads, however, can result in an incorrect behavior of these data structures. Graphics Processing Units (GPUs) have appeared as a powerful platform for high-performance computing. As regular data-parallel computations are straightforward to implement on traditional CPU architectures, it is challenging to implement them in a SIMD environment in the presence of thousands of active threads on GPU architectures. In this thesis, we implement a concurrent queue data structure and evaluate its performance on GPUs to understand how it behaves in a massively-parallel GPU environment. We implement both blocking and non-blocking approaches and compare their performance and behavior using both micro-benchmark and real-world application. We provide a complete evaluation and analysis of our implementations on an AMD Radeon R7 GPU. Our experiment shows that non-blocking approach outperforms blocking approach by up to 15.1 times when sufficient thread-level parallelism is present

    Energy and Area Efficient Machine Learning Architectures using Spin-Based Neurons

    Get PDF
    Recently, spintronic devices with low energy barrier nanomagnets such as spin orbit torque-Magnetic Tunnel Junctions (SOT-MTJs) and embedded magnetoresistive random access memory (MRAM) devices are being leveraged as a natural building block to provide probabilistic sigmoidal activation functions for RBMs. In this dissertation research, we use the Probabilistic Inference Network Simulator (PIN-Sim) to realize a circuit-level implementation of deep belief networks (DBNs) using memristive crossbars as weighted connections and embedded MRAM-based neurons as activation functions. Herein, a probabilistic interpolation recoder (PIR) circuit is developed for DBNs with probabilistic spin logic (p-bit)-based neurons to interpolate the probabilistic output of the neurons in the last hidden layer which are representing different output classes. Moreover, the impact of reducing the Magnetic Tunnel Junction\u27s (MTJ\u27s) energy barrier is assessed and optimized for the resulting stochasticity present in the learning system. In p-bit based DBNs, different defects such as variation of the nanomagnet thickness can undermine functionality by decreasing the fluctuation speed of the p-bit realized using a nanomagnet. A method is developed and refined to control the fluctuation frequency of the output of a p-bit device by employing a feedback mechanism. The feedback can alleviate this process variation sensitivity of p-bit based DBNs. This compact and low complexity method which is presented by introducing the self-compensating circuit can alleviate the influences of process variation in fabrication and practical implementation. Furthermore, this research presents an innovative image recognition technique for MNIST dataset on the basis of p-bit-based DBNs and TSK rule-based fuzzy systems. The proposed DBN-fuzzy system is introduced to benefit from low energy and area consumption of p-bit-based DBNs and high accuracy of TSK rule-based fuzzy systems. This system initially recognizes the top results through the p-bit-based DBN and then, the fuzzy system is employed to attain the top-1 recognition results from the obtained top outputs. Simulation results exhibit that a DBN-Fuzzy neural network not only has lower energy and area consumption than bigger DBN topologies while also achieving higher accuracy

    Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor

    No full text
    The 3D stacked integration of CPU, GPU and DRAM dies is a rising horizon in chip fabrication, where dies are vertically interconnected by TSVs (Through-Silicon Vias) to achieve high bandwidth, low latency and power consumption. However, thinned substrate, high power density and low thermal conductivity of inter-layer dielectric material cause thermal management a crucial problem. Moreover, the vertically stacked dies are susceptible to tight thermal correlations. High temperatures which tend to show higher spatial/temporal localities can make a negative impact on the IC’s reliability and lifetime. To mitigate such problems on CPU-GPU 3D heterogeneous processors, a novel dynamic temperature-aware task scheduling approach for compute workloads using OpenCL framework is proposed in this work. The proposed scheduler predicts the future temperature of each core from a regression model based on its current temperature, the neighbors’ temperatures and the execution profile of each workgroup. The scheduler then selects a core to assign workgroups from task queue based on their predicted temperature to keep the 3D chip below certain threshold temperature. Our experimental results demonstrate that the proposed scheduling technique is a viable solution to address the hotspots and heat dissipation issue of 3D stacked heterogeneous processors under reasonable performance tradeoffs
    corecore